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In an earlier report, the effects of several transmission systems on 
speaker verification by human listeners were investigated. It was 
shown that the transmission system played a significant role in the 
speaker verification process. In this paper, we show the effects of the 
transmission system on an existing automatic speaker verification 
system in which the measured features are pitch and gain as a 
function of time for a specified utterance. In this experiment, there 
were 10 male and 10 female customers and 40 male and 40 female 
impostors. Fifty utterances were recorded using a conventional tele- 
phone connection over a period of two months. All utterances were 
post-processed by an adpcm coding system and lpc vocoding system. 
When the reference and test utterances were subjected to different 
transmission systems, no significant difference in the verification 
accuracy of this automatic system was found. This result verifies that 
pitch and gain are robust features for use in a speaker verification 
system. 

I. INTRODUCTION 

The automatic speaker verification problem has two aspects — the 
creation of a reference pattern and the determination of similarity 
between a test and a reference pattern. When verification is performed 
over dialed-up telephone lines, the transmission system used in the 
telephone plant is an additional factor that must be considered. In a 
recent subjective experiment, 1 the effects of adaptive differential pulse 
code modulation (adpcm) coding and linear predictive vocoding (lpc) 
on the speaker verification accuracy of human listeners was investi- 
gated. It was shown that the verification task was easiest (most 
accurate) when homogeneous systems were used (i.e., the test and 
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reference utterances were transmitted over the same system) and 
significantly more difficult (higher error rate) when mixed systems 
(i.e., different transmission systems) were used for the test and refer- 
ence utterances. In this paper, we investigate how these same trans- 
mission systems affect machine verification accuracy using a system 
that has been studied for the past few years at Bell Laboratories. 2 " 6 

The automatic system is based on the analysis of fixed sentence-long 
utterances in which the verification features are the time variations 
(contours) of the pitch and gain (intensity) of the utterance. A training 
set of utterances (both customer and impostor) is required to establish 
a reference pattern and to choose weights and measurements for the 
verification process. Following time alignment of the reference and 
test contours, a combination of weighted Euclidian distances between 
a set of test and reference measurements is compared with a threshold 
to determine whether to accept or reject an identity claim. 

In an extensive investigation of the automatic speaker verification 
system over dialed-up telephone lines, Rosenberg obtained an average 
verification accuracy of about 91 percent. 2 Rosenberg also found that 
some talkers tended to perform significantly worse than average, and 
some significantly better than average. 

To investigate the behavior of the automatic speaker verification 
system on different transmission systems, a new data base of customer 
and impostor utterances was created. Dialed-up telephone connections 
were used in all recordings. Since the earlier work on human verifica- 
tion used wideband, high-quality recordings, the experiments with 
human listeners were repeated using the new data base. Following 
this, a series of experiments was run with the automatic verification 
system. 

The key results of this study are: 

(i) Human verification accuracy on the telephone speech was essen- 
tially the same as previously reported for the high-quality speech, i.e., 
the highest verification scores were obtained when the reference and 
test utterances were transmitted over the same system, and signifi- 
cantly lower verification scores were obtained when different trans- 
mission systems were used for the test and reference utterances. 

(ii) Machine verification accuracy on the telephone speech was 
essentially independent of the transmission system used for the test 
and reference utterances. 

These results tend to confirm the notion that pitch and gain are 
robust features for verification and hence are suitable for many appli- 
cations. 

The organization of this paper is as follows. The automatic speaker 
verification system is described in Section II. Section III describes the 
experimental procedure used to evaluate the automatic verification 
system. This section includes a description of the speech transmission 
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systems, as well as the data base used in the evaluation. The machine 
verification and the human verification results are presented in Sec- 
tions IV and V. Finally, in Section VI the main results of the experi- 
ment are discussed. 

II. THE AUTOMATIC SPEAKER VERIFICATION SYSTEM 

Although the operation of the automatic speaker verification system 
has been described previously, 2 " 6 a brief review is given here. A block 
diagram of the overall verification system is shown in Fig. 1. Two 
inputs are provided to the system. These are an identity claim which 
retrieves reference data associated with the claimed identity and a 
sentence-long sample utterance. The sample utterance is analyzed to 
extract time functions or contours of specified features which are 
compared with (previously obtained) reference contours. Reference 
contours are obtained by averaging and combining sets of contours 
obtained from training utterances from the individual whose identity 
is claimed. The features used in this experiment are the intensity 
(gain) and pitch period. The gain contour is normalized so that its peak 
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Fig. 1 — Flow diagram of the verification system. 
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over the entire utterance is a fixed value and is low-pass filtered to 
give a smooth contour. 

Before comparing sample and reference contours, time registration 
is carried out. Using a dynamic programming technique, the sample 
intensity contour is time-warped so that corresponding events in the 
sample and reference contours are aligned in time. The resulting time- 
warping function is also applied to the sample pitch period contour in 
order to align it to the reference contour. Following registration, the 
contours are divided into 20 equal length segments. In each segment, 
a set of measurements is applied to both the sample and reference 
contours. A squared difference is calculated specifying the dissimilarity 
between contours for each measurement and weighted inversely by a 
variance which is calculated from the set of training contours used to 
construct the reference. The effect of using the variance is to weight 
more heavily those segments in which a particular measurement is 
consistent over the set of training contours. 4 The various segment-by- 
segment measurements characterize the shape of the contours. In 
addition, the system computes distances based on the overall cross 
correlation of sample and reference contours (after time alignment) 
and distances based on the amount of warping requried to register the 
sample contours to the reference contours. 

These distances are combined into an overall distance in two differ- 
ent ways. The first, the "overall distance" procedure, is a simple 
(unweighted) average over the entire set of individual distances. In the 
second procedure, the "selected distance" is a simple average calcu- 
lated over a prespecified speaker-dependent subset of the entire set of 
distances. The subset is obtained as part of the training procedure by 
selecting those distances which are most effective in separating popu- 
lations of customer and impostor utterances. 

For either procedure, the combined distance is compared with a 
speaker-dependent threshold to determine whether to accept or reject 
the identity claim. The threshold distance, obtained from the training 
set and included in the reference data, is estimated from the overall 
distance distribution of distances from customer and impostor training 
utterances. Normally, a threshold is chosen to equalize the false 
(impostor) acceptance and false (customer) rejection rates. In the 
absence of direct knowledge of the costs of rejecting a customer or 
accepting an impostor, setting the threshold to give equal error rates 
yields the minimum cost. This is called the equal error criterion in this 
paper. In many real-world applications, the costs of these two types of 
errors would not be equal — e.g., in a banking situation the cost of 
rejecting a customer would be lower than the cost of accepting an 
impostor. In such cases, the threshold would be adjusted appropriately. 
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III. EXPERIMENTAL EVALUATION 

To evaluate the automatic verification system of the previous sec- 
tion, a speech data base of utterances was created. The experimental 
setup for creating this speech data base is shown in Fig. 2. The speech 
was recorded in a sound booth over conventional dialed-up telephone 
lines. The signal was bandlimited from 100 to 3200 Hz (the nominal 
telephone bandwidth) and digitized at a 10-kHz rate. Both the refer- 
ence and the test utterances were processed by one of the following 
three transmission systems: 

(i) Clear channel — i.e., no additional processing. 

(ii) Adaptive differential pulse code modulation (adpcm) coding. 
(Hi) Linear predictive vocoding (lpc). 

The adpcm coder used in this experiment was a simulation of the 
coder built by Bates, 7 based on the work of Cummiskey et al. 8 Figure 
3 is a block diagram of the adpcm system. Since the required sampling 
rate for the adpcm coder was 6 kHz, a sampling rate conversion system 
was used to convert from 10 to 6 kHz at the input to the coder. 9 The 
signal bandwidth was reduced to 2600 Hz for the adpcm coder by using 
a 100- to 2600-Hz bandpass filter in the sampling rate conversion 
system. In the coder, a 4-bit adaptive quantizer was used to code 
the difference signal (S(n) in Fig. 3), giving an overall bit rate of 
24 kbits/s for the coder. The step-size multiplier of the quantizer 
ranged over a 41-dB range (i.e., the ratio between the smallest step 
size was 114 to 1). A first-order predictor was used with a multiplier 
coefficient of a = 0.9375. Signal levels were chosen so that the coder 
was operating at approximately the optimum range. 8 

A block diagram of the lpc vocoder is given in Fig. 4. The imple- 
mentation was based on the autocorrelation method of linear predic- 
tion. 10 " 12 Pitch detection and voiced-unvoiced decision were performed 
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Fig. 2 — Block diagram of the data collection system. 
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Fig. 3 — Block diagram of the adpcm coder. 

using the modified autocorrelation pitch detector of Dubnowski et al. 13 
A 12-pole lpc analysis was performed using a pitch-adaptive, variable 
frame size, at a rate of 100 frames per second. 14 No quantization of the 
lpc parameters was used in this experiment. 

To evaluate the effects of the three transmission systems on verifi- 
cation accuracy, a data base was designed which included: 

(i) Fifty recordings made by each of 20 experienced talkers (10 male 
and 10 female) over a period of two months. The first 10 recordings 
were made once a day; the remaining 40 were made twice a day 
(morning and afternoon). These talkers were designated "customers." 

(ii) One recording made by each of 80 naive talkers (40 male and 40 
female). These talkers were designated "impostors." There was no 
attempt to mimic the "customers." 

Two all-voiced sentences were used in the recordings. The males used 
the sentence, "We were away a year ago," and the females used the 
sentence, "I know when my lawyer is due." In previous studies, only 
the first sentence was used. 2-5 

Since the automatic speaker verification system used pitch and 
intensity contours as features, these contours were measured once and 
stored on disk for later retrieval in the experiment. 

3.1 Reference construction 

For each customer and each transmission system, pitch and intensity 
contours from 10 of the 50 utterances were used to construct "refer- 
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Fig. 4 — Block diagram of the lpc vocoder 

ence" contours. As discussed previously, in order to complete the 
reference construction (i.e., to get weights, thresholds, and to choose 
selected distances), 10 additional utterances of the 50 were used, along 
with the pitch and intensity contours of 15 impostors of the same sex 
as the customer. Two different sets of reference files were created for 
each customer — one obtained from the first 20 consecutive recordings 
of the customer (method 1) and the other obtained by using two of 
every five utterances recorded (method 2). 

Figure 5 is a series of plots of relative cumulative frequency 
distributions of customer and impostor samples as a function of com- 
bined distance. Each column contains the results for two female and 
two male customers using the (method 1) training data. In each plot, 
the customer sample distributions (on the left) show the fraction of 
samples with distances greater than the abscissa value while the 
impostor sample distributions (on the right) show the fraction of 
samples with distances less than the abscissa value. The first column 
shows the results for clear channel utterances; the second column 
shows results for lpc vocoded utterances, and the third column shows 
results for adpcm coded utterances. The decision threshold is chosen 
as the distance where the cumulative distributions cross (i.e., the equal 
error threshold), or, in the case when the distributions are separated, 
the point midway between the ends of the separate distributions. Since 
there was only a small number of training utterances, the distributions 
for the two types of errors are poorly defined and only a rough estimate 
of the decision threshold is obtained. It should be clear that the equal 
error threshold will vary for each pair of transmission systems being 
compared, as well as for different talkers. For the four talkers shown 
in Fig. 5, the worst equal error threshold indicates a 10-percent error 
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Fig. 5 — Plots of the cumulative distributions of the errors for four talkers and three 
transmission systems based on the training set of utterances. 

rate; however, for the 20 talkers, the average equal error rate over all 
conditions was about 1 percent. Figure 5 also shows that the value of 
the threshold distance varies considerably (2.5 to 1) across talkers, as 
does the shape of the cumulative distributions. These results tend to 
indicate that the training data are inadequate for obtaining a good 
estimate of the equal error rate threshold. 

IV. RESULTS ON AUTOMATIC SPEAKER VERIFICATION 

To test the automatic speaker verification system for each trans- 
mission system, each customer reference was compared to the 30 
customer utterances and the 25 impostor utterances which were not 
used in the training set. For each set of comparisons, a Type 1 
(customer rejection) and a Type 2 (impostor acceptance) error score 
was measured. If we denote the Type 1 error scores as E\ and the Type 
2 error scores as E2, then E\ and E2 are functions of: 

(i) The transmission system used in the training, i, where i = 1 
denotes the clear channel, i = 2 denotes the lpc vocoder, and i = 3 
denotes the adpcm coder. The mnemonics C, V, and A are used in the 
plots to denote clear channel, lpc vocoder, and adpcm coder, respec- 
tively. 

(w) The transmission system used in the testing, /, where j = 1, 2, 
and 3 are identical to i = 1, 2, and 3. 
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(Hi) The training method, k, where k — 1 denotes training method 

1 and k = 2 denotes training method 2. 

(iv) The type of measurements used in the verification distance, I, 
where / = 1 is selected measurements (speaker specific), and / = 2 is 
overall measurements (speaker independent). 

(v) The customer, m, where m = 1 to 10 for the 10 male customers 
and m = 11 to 20 for the female customers. 

Since different sentences were used for male and female customers, 
results are presented separately for each subset of the customers. 

To illustrate some of the results, Fig. 6 shows plots of E\ and E 2 as 
a function of the training system, testing system pair (i,j), and talker 
(m), for selected distance measurements (1 = 1), and training method 

2 (k = 2). Figures 6a and 6b are E\ scores for male customers (m = 1 
to 10), and female customers (m = 11 to 20), and Figs. 6c and 6d are E 2 
scores for male and female customers. A bar graph denotes the error 
score for each condition. The reader should note that within each 
group there are 10 bars, some of which are indicating zero error. 
From this figure the following observations can be made: 

(i) E 2 scores are significantly smaller than E\ scores, indicating 
that the distance threshold obtained from the training set for equal 
errors (E\ = E 2 ) was not a stable point. 
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Fig. 6 — Plots of Ei and E 2 versus system pair and talker for male and female talkers 
(training method 2 using selected distance measurements). 
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(ii) A high degree of variability in scores (for both E\ and E 2 and for 
male and female customers) exists among customers for each pair of 
transmission systems. This type of result was also obtained earlier by 
Rosenberg. 2 

(Hi) Error scores for female customers are somewhat smaller than 
error scores for male customers. 

(iv) The variability between scores for pairs of transmission systems 
is smaller than the variability of scores within a pair of transmission 
systems. 

Based on the above observations, it is clear that the Ei and E2 scores 
cannot simply be averaged over customers because of the high varia- 
bility among customers. Thus two data processing procedures were 
tried, one to use the median over customers, the other to statistically 
eliminate the extremes of the distribution of error scores (using a 
recently described method 15 ) and then to average the scores of the 
remaining customers. Both data processing methods yielded essentially 
the same results and hence only the results of taking medians are 
presented here. 

Table I gives values of the medians of E\ and E2 over (male and 
female) customers for each (i, j) pair, and for k = 1 and 2, and / = 1 
and 2. Also included in this table is the median of the quantity 

E 3 = (E x + E 2 )/2, 

which is the average error rate of the system. Since E\ and E 2 were 
significantly different, Ea provides a better measure of the overall 
performance of the verification system than either E\ or E 2 . It can be 
seen from Table I and statistically verified at the 0.001 level that 
training method 2 provides significantly better scores than training 
method 1, and that using selected measurements for the distance score 
provides significantly better scores than the overall measurements. As 
such, we restrict our discussion to this case only, i.e., selected mea- 
surements from training method 2. 

Figure 7 is a plot of the Es median scores for each pair of transmis- 
sions systems. Although there is some variability in score among these 
systems, the variability is statistically insignificant (at the 0.01 level). 
Thus the major result of the testing is that the verification accuracy is 
relatively insensitive to the transmission system used for training and 
testing. This result is very different from the one obtained when 
verification is performed by human listeners as discussed previously. 
To ensure that the human verification accuracy for this new data base 
remained the same, the perceptual verification experiment was re- 
peated, and the results are given in the next section. 

Before the results of the perceptual experiment are described, 
Pig. 8 illustrates the variability of the distance threshold in the testing. 
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This figure shows the sum of the cumulative error distributions for the 
testing results for several different cases. Each part of the figure 
contains three vertical lines. The solid vertical line is the a priori 
distance threshold which gives an equal error based on the training 
data. The dashed vertical line is the a posteriori distance threshold 
that gives equal error based on the testing data. The dotted vertical 
line is the threshold that minimizes the total error Ea for the testing 
data. In the ideal case, all three thresholds would be equal. However, 
because of the inadequacy of the training data and the unusual shapes 
of the cumulative error distribution, thresholds were different, as seen 
in this figure. The variability in both distance thresholds and error 
scores is fairly large (typically, between 10 and 30 percent in most 
cases). 

V. HUMAN VERIFICATION TEST RESULTS 

To verify that the human verification accuracy was strongly affected 
by the speech transmission system when using the telephone record- 
ings, the experiment performed in Ref. 1 was repeated exactly. The 
results of these tests are given in Figs. 9 and 10. Figure 9 shows false 
alarm (customer rejection or Type 1 error) and miss (impostor accept- 
ance or Type 2 error) rates for the male and female customers as a 
function of the pair of transmission systems used in the comparison. 
Only the first eight customers (female and male) were used to corre- 
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Fig. 8— Total error (£, + £ 2 ) as a function of distance threshold for four conditions 
for the testing data. 

spond to the eight customers used in the earlier experiment. This 
figure again shows the high degree of variability of the scores among 
customers. Thus, to combine customer scores, a median was used 
instead of averaging. Figure 10 shows the median false alarm rate, miss 
rate, and overall error rate for each pair of speech transmission 
systems. 

The results shown in Figs. 9 and 10 are essentially identical to those 
reported previously, 1 namely, that the false alarm rates for transmis- 
sion pairs which were the same were statistically significantly lower 
than for mixed systems, whereas the miss rates for homogeneous 
systems were larger than for mixed systems. Statistical comparisons 
showed that, in this experiment, the results were not statistically 
significantly different from those of the earlier experiment in any 
category. 



AUTOMATIC SPEAKER VERIFICATION SYSTEM 2083 











FALSE ALARM RATE 




50 
40 
30 
20 


MALE CUSTOMERS 


' 


n 






__ 








f 










n 


m r 


n 


I 


r 


nril 1 fir 


I 


rrr 


r 


■ 



C-C A-A V-V A-C V-C V-A 



50 
40 
30 
20 
10 



FEMALE CUSTOMERS 


r 




. 






n ri 


_. 


I 




n 




n m 


r [- 


in. 


if 


fl 


r 


In f 


■i [ 


- 



C-C A-A V-V A-C V-C 



V-A 





MALE CUSTOMERS 


n 


_ 




n 






m 


_ 




TTI 


r 




.. 


n 


in i 




r 


nr 


, 


1 r r 


. 




fT 


r 


r 


1 



A-C 



V-C 



V-A 





FEMALE CUSTOMERS [ 




n 


m n 






n 




n 






. 


• 


- 










. 


. 


T 


..—£.. 






r 


...i .-: ..: 



Fig. 9 — False alarm and miss rates as a function of the system pair for male and 
female customers for the human verification experiment. 
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VI. DISCUSSION 

The major result of this investigation was the finding that, for an 
automatic speaker verification, accuracy was statistically insensitive 
to either adpcm coding or lpc vocoding of the speech utterance for 
either (or both) the reference and test utterances. For this same data 
base, the verification accuracy by human listeners was significantly 
affected by the different transmission systems in the manner described 
in Ref. 1. The conclusion drawn from these results is that pitch and 
gain are reasonably robust to the distortions of adpcm coding and lpc 
vocoding, thereby enabling the automatic system to be insensitive to 
these transmission systems. 

The overall verification accuracy in this system was about 12 percent 
for male talkers and 8 percent for female talkers. These verification 
rates are comparable to those obtained by Rosenberg in a large 
experiment over dialed-up telephone lines. 2 As in the earlier work, 
considerable variability in verification scores among talkers was found, 
again indicating that the variability of pitch and gain for some talkers 
is large, and thus for these talkers other feature sets should be 
considered for verification. Recent unpublished investigations by Furui 
indicate significantly smaller (on the order of 0.5 percent) error rates 
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when the features are cepstral coefficient contours rather than pitch 
and gain. Whether such feature sets are robust to coding and vocoding 
remains to be investigated. 

It was found in this investigation that the training methods used 
were inadequate for giving stable reference contours and reliable 
distance thresholds. This effect was previously noted by Rosenberg, 2 
Furui, 16 and Furui et al., 17 who showed that long-time variability in 
feature contours had to be taken into consideration to obtain stable 
reference data for verification. 

Two other effects were noted during the course of this study. First, 
it was found that selected distances provided significantly better scores 
than overall distances. Rosenberg also noted that, once a reasonable 
amount of training data was obtained, selected distances were better 
than overall distances. 2 Thus the results here indicate that 10 training 
utterances are sufficient for selected distance scores to be superior to 
overall distance scores. The second point concerned the lower error 
rates for female talkers than for male talkers. Rosenberg found no 
statistically significant differences between verification scores for 
males and females. 2 Thus, this result may be due to the difference in 
sentence used in the verification task. If this is true, then the impli- 
cation is that the test utterance chosen may provide small but consis- 
tent improvements in the verification scores. 

VII. SUMMARY 

In this paper, we have shown that, whereas the false alarm and miss 
rates for verification by human listeners are strongly affected by the 
pair of transmission systems used for the reference and test utterances, 
the false alarm and miss rates for an automatic verification system 
based on pitch and gain are relatively insensitive to the transmission 
system in the case of adpcm coding and lpc vocoding. Although the 
average overall error rate for this system was around 10 percent, the 
robustness of pitch and gain to transmission systems makes them 
attractive features for automatic speaker verification systems. 
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